-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 FIX: Ensure Container DB always closed after access #5123
Conversation
This certainly removes the warning for open files in Still not entirely sure how this relates to threads though, but to note the check can be turned on/off at the sqlite level: |
Another consideration here, is that in the def close(self) -> None:
"""Close open files (in particular, the connection to the SQLite DB)."""
if self._session is not None:
self._session.close()
self._session = None In the code here, we may just want to close the session, but not throw it away (which then requires having to create a whole new engine) |
3383705
to
0e4cf22
Compare
Codecov Report
@@ Coverage Diff @@
## develop #5123 +/- ##
===========================================
+ Coverage 82.06% 82.06% +0.01%
===========================================
Files 533 533
Lines 38296 38307 +11
===========================================
+ Hits 31424 31434 +10
- Misses 6872 6873 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Trying to understand how this is fixing the problem that you describe. Firstly, I am not sure where in the |
The file is the sqlite database; when you close the session you close the file (well at least you allow it the possibility to close) |
That's the wrong context manager you are looking at; I'm talking about the one on the |
Basically, I believe it is not ideal that we are not specifically closing connections with the sqlite DB ever. |
My feeling in general, is that it would probably be "optimal" to abstract the database connection from the
and just have the container's DB table stored within the Postgres DB. After all, the container is basically useless anyway, without the Postgres DB to tell you the (one-to-many) relationship between the hashkeys and the node files |
Did a very quick and dirty test of performance between #!/usr/bin/env python
import io
from aiida.orm import Data
count = 100
node = Data().store() # Need to store it to actually use the disk object store, otherwise the sandbox is used
def write_files(count):
for i in range(count):
filename = f'file{i}.txt'
node._repository.put_object_from_filelike(io.BytesIO(f'some\ncontent{i}'.encode('utf-8')), filename)
def read_files(count):
for i in range(count):
filename = f'file{i}.txt'
with node.open(filename) as handle:
handle.read()
%timeit write_files(100)
%timeit read_files(100)
|
Yep, I believe writing should not be affected at all, because you are just writing the "loose" file to the container folder (and not touching the database). |
It would affect the places where we write directly to pack files though, which would require a database connection. Although currently we only do this in the export command I believe. There we can of course make sure that we perform all operations in a single context manager, so it wouldn't be too problematic. |
One thing to quickly note here: when reading a file, currently a database session is always started, even if the hashkey ends up being related to a "loose" file, since the database is checked first before the loose folder. |
Thanks @chrisjsewell for looking into this and @sphuber for testing.
Indeed, that is my main concern (as you both stated I think, the issue is mostly for a lot of small files). Related to this PR, I guess that, if the speed is sub-optimal but still acceptable, and the behaviour is more robust, it's ok to merge for now and open, at the same time, a new issue about checking the performance of this and how to improve it (probably looking into bulk methods). At least we can move on with the release, and then we improve performance later. Or add some "bulk" operations where everything is done within the context manager. But I agree that it's sub-optimal to reopen the DB every time, this has a cost as shown by Sebastiaan (reading becomes even slower than writing). BTW - can you @sphuber (or @chrisjsewell) run the same test, but pack the loose objects before reading? I think the performance we really care about is for packed objects, as we tell people to pack anyway and there's no way back (and if the performance is OK with packed objects, then we're OK - they can just pack to improve performance). Even if I guess the slowdown is identical, since the SQLite file needs to be re-opened every time. A few additional comments just to answer to some points:
I see - one needs of course to change slightly the code for caching the session. @chrisjsewell if you think this helps, you can try doing this at the disk-objectstore level and I think that, if all tests continue to pass (and performance improves), then we're safe (just open an issue there if you don't want to do it now but still think it can help).
I see the general reasoning, but is this really a problem? I'm not aware of explicit suggestions to close the connection (file) to SQLite (but I didn't check so maybe I'm wrong). But maybe this interacts with some other things that SQLA does (connection pooling etc.?). @chrisjsewell just to double check - is this login going to interact correctly with your new context manager?
I see and you're probably right for AiiDA. My original design was with SQLite to keep it separate from AiiDA, and it gives the name "disk" of "disk-objectstore" here was meant to be "stores the state on disk, no need of any running service". For AiiDA, it might be interesting to look at moving all inside AiiDA - but 1. there are a few subtleties related e.g. with how transactions are managed in PSQL vs. SQLite (or other things like vacuuming) that would require a bit of thought and testing, 2. one would need to re-benchmark speed, 3. backups of the repository need also a dump of the DB (again, OK for AiiDA, but less so for the library used as a standalone, and 4. by the way SQLite is even more robust and has less issues related to storing files than even PSQL :-D
There may well be a reason for this ordering (I can't remember offhand), but obviously looking in the loose folder first would minimize db opens I might need to think a bit more if there are more reasons. One is the way the various operations are organised in order to guarantee that concurrent writing and packing from many processes does not create issues (but I would need to think a bit more to give an example where the other order would be problematic - and maybe one could change the overall logic to overcome this issue, I don't know). |
Let's make a decision on this one. There is one thing I am not clear on yet: @chrisjsewell you claim that this may fix the problem of the warnings of files being accessed on different threads. May I ask why you think this? I don't really see how closing the files would get rid of this. Anyway, if this really does fix the problem, I would be ok with merging this and accepting the performance hit. |
@chrisjsewell - do you happen to have some simple tests on how much this might affect performance? (E.g. storing new nodes (even if this stores loose objects, so might not be affected) or loading nodes (after packing their files) or more specifically getting files from nodes. |
I have already performed this quick benchmarking and attached the results in this PR. (See this comment). Or are you thinking of different tests? |
Ah sorry - I had completely forgotten about that you had already done the tests, sorry for not checking back in the thread. |
0e4cf22
to
2144c30
Compare
@giovannipizzi Fine for me to go with that approach. @chrisjsewell what was your original rule of thumb on when to use a content manager when using the |
@chrisjsewell shall we wrap this one up as well before the release? Could you please look at my previous comment? |
No, I think it should be fine to use it every, given that it is only the sqlalchemy database if it opened it. One quick thought, we have: @property
def container(self) -> Container:
return self._container but do we actually need to expose this publicly, given that it should only ever be used as a contextmanager? |
Good point, the current uses outside of the class seem to be tests in And a similar point goes for the second test. It is testing export and import of repo files, but that should be already done elsewhere. So I think that test can also be removed. So if you agree, I will go ahead and remove the public property and update/remove those tests. |
yep sure 👍 |
83ae467
to
eb4e7ad
Compare
5cad5c0
to
c9c5afe
Compare
This should not be used externally. The two tests that were using it should also not have to use it directly.
c9c5afe
to
7e0cb19
Compare
Possibly fixes #4899
This is the "super cautious" approach, whereby we ensure the sqlite database is closed after every access. Obviously this may not be performant if you want to perform multiple operations
Alternatively, we could add
__enter__
/__exit__
methods to theAbstractRepositoryBackend
/Repository
itself, and then move the responsibility of using a context manager to uses of it.Possibly changing
Profile.get_repository
to a contextmanagerProfile.with_repository